arXiv:1505.08052v4 [stat.ML] 15 Oct 2015 


Batch Bayesian Optimization via Local Penalization 


Javier Gonzalez Zhenwen Dai 

University of Sheffield University of Sheffield 


Abstract 

The popularity of Bayesian optimization 
methods for efficient exploration of param¬ 
eter spaces has lead to a series of papers ap¬ 
plying Gaussian processes as surrogates in 
the optimization of functions. However, most 
proposed approaches only allow the explo¬ 
ration of the parameter space to occur se¬ 
quentially. Often, it is desirable to simulta¬ 
neously propose batches of parameter values 
to explore. This is particularly the case when 
large parallel processing facilities are avail¬ 
able. These could either be computational 
or physical facets of the process being opti¬ 
mized. Batch methods, however, require the 
modeling of the interaction between the dif¬ 
ferent evaluations in the batch, which can be 
expensive in complex scenarios. We investi¬ 
gate this issue and propose a highly effective 
heuristic based on an estimate of the func¬ 
tion’s Lipschitz constant that captures the 
most important aspect of this interaction— 
local repulsion—at negligible computational 
overhead. A penalized acquisition function 
is used to collect batches of points minimiz¬ 
ing the non-parallelizable computational ef¬ 
fort. The resulting algorithm compares very 
well, in run-time, with much more elaborate 
alternatives. 


1 Introduction 

Many problems, such as the configuration of machine 
learning algorithms [Snoek et ah, 2012] or the ex¬ 
perimental design of biological experiments [Gonzalez 
et ah, 2014] require the optimization of an unknown. 
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possibly noisy, function /. Bayesian optimization 
(BO) has emerged in this scenario as an efficient 
heuristic to optimize / if function evaluations are 
costly and the overall number of evaluations must be 
kept low [Jones et ah, 1998]. 

The task is to solve the global optimization problem 
of finding 

XM = argmax/(x). (1) 

We assume that / is a black-box from which only per¬ 
turbed evaluations of the type pi = /(x^) + Ci, with 
Ci ~ A/’(0, cr^), are available. We will assume that 
the objective of interest can be described well by a 
L-Lipschitz continuous function / : A —> IR defined on 
a compact subset A C IR'’*. 

In sequential BO the goal is to make a series of eval¬ 
uations xi,..., X]v of / such that the maximum of 
/ is evaluated as quickly as possible. After n points 
are available, BO proposes a new location x„_|_i using 
a probabilistic model for /, conditioned on all previ¬ 
ous observations = {(x^,Typically the 
model is a Gaussian process (GP) p{f) = G'P{n; k) 
with mean function p, and positive-definite covariance 
function (kernel) k that in this work we assume is sta¬ 
tionary. Under Gaussian likelihoods, the posterior dis¬ 
tribution of / (for a sample of size n) is also a GP, with 
posterior mean and variance given by 

/x„(x*) = k„(x*)^[K„ -h cr^I]"V„ 

and 

al{x*) = fc(x*,x*) - k„(x*)^[K„ + a2l]-ik„(x*), 

where K„ is the matrix such that (K„)jj = k{‘Xi,Xj), 
k„(x*) = [fc(xi, X*),..., fc(x„, x*)]^ [Rasmussen and 
Williams, 2005] and x* is the point where the GP is 
evaluated. 

This posterior is used to form the acquisition func¬ 
tion a(x;2„), where represents the available data 
set T>n and the GP structure (kernel, likelihood and 
parameter values) when n data points are available. 
The next evaluation is placed at the (numerically esti¬ 
mated) global maximum x„+i of this acquisition func¬ 
tion. A number of possible acquisition functions are 


now available, ranging from fast heuristics [Osborne, 
2010, Jones et ah, 1998] to non-local entropy-based 
approaches [Hennig and Schuler, 2012, Hernandez- 
Lobato et ah, 2014]. 

While the goal of Bayesian optimization is to keep the 
number of evaluations of / as low as possible, in high¬ 
dimensional and or otherwise complex problems, the 
number of required evaluations can still be consider¬ 
able. Parallel approaches arise as the natural solution 
to circumvent the computational bottleneck around 
these evaluations of /. We focus on cases in which the 
cost of evaluating / in a batch of points of size Ub is the 
same as evaluating / in a single point. Such scenarios 
appear, for instance, in the optimization of computer 
models where several cores are available to run in par¬ 
allel, or in wet-lab experiments when the cost of testing 
one experimental design is the same as testing a batch 
of them. In these settings, the set of available pairs 
{(xi, can be augmented with the evaluations 

of / on batches of data points = {xj^i,..., Xt_„f,}, 
for t = l,...,m, rather than on single observations. 
Our goal here is to define a design rule for such batches 
Bi'’ ,..., B^. The batch selection problem can be gen¬ 
eralized further, e.g. by adapting the batch size [Azimi 
et ah, 2012] or by collecting batches asynchronously 
[Ginsbourger et ah, 2011, Janusevskis et ah, 2012, 
Snoek et ah, 2012]. For simplicity of exposition these 
ideas will not feature further here. 

1.1 Optimal batch design and previous work 

The goal of any batch criterion is to mimic the deci¬ 
sions that would be made under the equivalent (opti¬ 
mal) sequential policy: Consider the choice of select¬ 
ing Xt fc, the fc-th element of the t-th batch. Under 
a sequential policy, in which the evaluations of / at 
all locations prior to Xj^^ are available, the decision is 
to take xt^fc as the maximizer of In the 

batch case, the decision about where to collect xj^fc 
has to incorporate the uncertainty about the locations 
xt_i,..., Xi_fe_i, and the outcomes of the evaluation of 
/ there. Iteratively marginalizing these sources of un¬ 
certainty gives 

xt.fc = argmax / a{yL]It^k-i) (2) 

xea: J 

k-l 

]Jp(2/t.j|xty,Tty-i)p(x(jl2'tj_i)dxtjdytj, 

i=i 

where 

= A/” {ytj] Mn(Xtj), a’„(xtj)) 

is the predictive distribution of the GP at xtj when a 
total of n points are available and 

= J(xtj - argmaxa(x;2'tj_i)) 


reflects the optimization step required to obtain Xjj 
after the evaluations of / at previous batch-elements 
have been marginalized. 

The optimization in Eq. (2) is intractable even 
for small batch-sizes, due to the optimization- 
marginalization loop required to obtain Xj fc. The lit¬ 
erature in batch BO has tried to avoid this computa¬ 
tional burden by means of different strategies, most 
of which involve the explicit use of the predictive dis¬ 
tributions p(ytjlx(j,2:tj_i), for j = l,...,n6. Ex¬ 
ploratory approaches [Schonlau et ah, 1998, Cental 
et ah, 2013] search for a reduction in system uncer¬ 
tainty. This is using the property that the variance of 
does not depend on the value of the 
objective there. Other methods use p{ytj\xt j,It j_i) 
to generate ‘fake’ observations of the model [Azimi 
et ah, 2012, 2011, Bergstra et ah, 2011] and avoid 
the marginalization step. In statistics, the suitability 
of the expected improvement utility has been stud¬ 
ied for the design of batches [Chevalier and Gins¬ 
bourger, 2013, Frazier, 2012]. In contrast to the pre¬ 
vious mentioned works, these methods use the joint 
distribution of ytj,... yt,nb to simultaneously optimize 
elements on the batch [Azimi et ah, 2010]. These non- 
greedy strategies are very well founded from a theo¬ 
retical perspective in practice but tend to scale poorly 
with the dimension of the problem and the sizes of 
the batches. Other theoretical properties of batch BO 
have been studied in the context of Bayesian networks 
[Oenek and Schwarz, 2000], multi-armed bandits [De- 
sautels et ah, 2012], and the optimal balance between 
exploration and exploitation in batch designs [Jalali 
et ah, 2013]. 

1.2 Goal and Contributions of this work 

Using p(ytj\xtj,2tj-i) to model the interaction be¬ 
tween batch elements has a computational overhead of 
0{n^), since the GP needs to be updated after each 
batch location is selected to jointly optimize all the el¬ 
ements in the batch. The motivation of this work is to 
develop a heuristic approximation of Eq. (2) at lower 
computational cost, while incorporating information 
about global properties of / from the GP model into 
the batch design. 

Our approach rests on the hypothesis that / is a Lip- 
schitz continuous function, which is a common as¬ 
sumption in global optimization [Eloudas and Parda- 
los, 2009]. Eor easy reference: a real-valued function 
/ : A —)■ IR on a compact subset X C IR'^ of the d- 
dimensional real vector space is said to be L-Lipschitz 
if it satisfies 

l/(xi) -/(X2)| < L||xi-X2II, Vxi,X2GA ( 3 ) 
where A is a global positive constant, and || • || is the 



batch BO methods in terms of the convergence to the 
maximum, but shows better performance in terms of 
gained information per second. 



X 


Figure 1: Forrester function f{x) = (6a; —2)^ sin(12a; — 
4) in the interval [0.3,01,4]. We take 6 evaluations 
xi,..., Xg of the function, M = max^ /(x^) and L = 
400. The exclusion zones for the maximum of / deter¬ 
mined by the balls are shown. 

.^^-norm on IR'^, a property that has been previously 
exploited in global optimization [Horst and Pardalos, 
1995, Strongin and Sergeyev, 2000]. 

In the context of parallelizing Bayesian optimization, a 
beneficial aspect of the Lipschitzian assumption is that 
it naturally allows us to place bounds on how far the 
optimum of / is from a certain location. See Figure 1 
for details. As explained below, this information can 
be used to define policies to collect a batch of points 
multiple steps ahead without evaluating /, by mimick¬ 
ing the hypothesized behavior of a sequential policy. 
The main challenge is that, in practice, the constant L 
is unknown. In the literature, this problem has been 
addressed from different angles [Floudas and Parda¬ 
los, 2009]. We explore a new alternative: inferring the 
Lipschitz constant directly from the Gaussian process 
model for /. 

Our contributions are: (i) A new batch BO heuristic, 
BBO-LP, that selects batches of points by an iterative 
maximization-penalization loop around the the acqui¬ 
sition function. This leads to efficient parallelization 
of BO and can be used with any acquisition function, 
(ii) A probabilistic framework to approximately infer 
the Lipschitz constant of /, termed GP-LCA, that uses 
the properties of the gradients of the GP. The inferred 
value of L is used to improve batch selection, (iii) 
A python implementation of several batch BO meth¬ 
ods is published in conjunction with this work.^ (iv) 
Gonfirmation of the effectiveness of the approach is 
demonstrated through several simulated experiments, 
an algorithm configuration problem, and a real wet-lab 
experimental design. In particular, the local penaliza¬ 
tion approach performs equal or better than current 


'^http://shefiieldml.github.io/GPyOpt/ 


2 Maximization-Penalization Strategy 
for Batch Design 

The intuition behind our approach is that for most GP 
priors in practical use for BO, the dominant effect of 
a function evaluation on the acquisition function is a 
local exclusion around the new evaluation. This shape 
of the acquisition function will be modeled through the 
Lipschitz properties of /, to distribute the elements in 
each batch. This should be understood as a heuristic 
to the shape of a{x;2t,k-i) if all previous observations 
were available, mimicking the effect a sequential pol¬ 
icy. This is especially useful in cases in which the ac¬ 
quisition function shows multi-modal shape, a common 
situation in the first iterations of BO algorithms. The 
following definition is helpful for the formalization of 
the algorithm: 

Definition 1 A function (p{x;Xj), x G X, is a lo¬ 
cal penalizer of a generic acquisition function Q!(x) at 
Xj */y)(x;Xj) is differentiable, 0 < 93(x;xj) < 1 and 
is an non-decreasing function in ||x —XjH. 

We propose to replace the maximization- 
marginalization loop in Eq. 2 by a maximization- 
penalization strategy: while the optimization is 
carried out in a similar fashion, the marginaliza¬ 
tion step is replaced by the direct penalization of 
q;(x;I( fc_i) around its most recent maximum, i.e, the 
previous batch element. Figure 2 gives a graphical 
illustration. The maximization-penalization strategy 
selects xj^fe as 

fc-i 'I 

n^(/5(x;xt,j) I , (4) 

where (/3(x;xtj) are local local penalizers centered at 
Xt j and g : IR —>■ IR^ is a differentiable transformation 
of q;(x) that keeps it strictly positive without chang¬ 
ing the location of its extrema. We will use g{z) = z if 
q;(x) is already positive and the soft-plus transforma¬ 
tion g{z) = In(l-l-e^) elsewhere. This does not require 
re-estimation of the GP model after each location is 
selected, just a new optimization of the penalized util¬ 
ity. 

The effect of a local penalizer is to smoothly reduce the 
value of the acquisition function in a neighborhood of 
Xj. A ‘good’ local penalizer centered at Xj should re¬ 
flect the belief about the distance from Xj to xm- If 
we suspect that xm is far from Xj, a broad (p(x;xj) 
will discard a large portion of X in which we don’t 


xt,fc = argmax < g{a{x;Itft)) 









1st batch element 2nd batch element 



3th batch element 



Figure 2: Illustration of three iterations of the maximization-penalization loop. The main task of good batch 
design is to explore the modes of the acquisition function, achieved by iterative maximization (black stars) and 
penalization (using (^i(x), (p 2 (x)) of the acquisition function q;(x). 


need to collect any sample. On the other hand, if we 
believe that xm and Xj are close, ideally we want to 
minimize the penalization of a(x) and keep collecting 
samples is a close neighborhood. This local penaliza¬ 
tion mimics the acquisition function’s dynamics under 
a sequential policy in the following sense: the modes 
of the acquisition functions correspond to regions in 
which either /i„(x) or (T^(x) (or both) are large. Eval¬ 
uating, for instance, where crn(x) is large will reduce 
uncertainty in that region, decreasing q;(x) in a neigh¬ 
borhood. The functions :/7(x;Xj) are surrogates for 
this neighborhood. 

2.1 Choosing Local Penalizers (/3(x;xj) 

We now construct penalizing functions :p(x;xj) that 
incorporate into a{x) the current belief about the dis¬ 
tance from the batch locations to xm- Take M = 
maxxGA' /(x), and a valid Lipschitz constant L. Con¬ 
sider the ball 

Br, (xj) = {x e A’ : ||Xj - x|| < Tj} (5) 

where 

M-/(x,) 

L 

To simplify the notation we write Vj = r{x.j) for the 
radius of the ball around Xj. If / in (5) is the true 
optimization objective, then xm ^ Br^ (x)—otherwise 
the Lipschitz condition would be violated. The size of 
Br^{xj) depends on L, M and the value of / at Xj. 
Both large variability in / (large L) and proximity of 
/(xj) to the optimum M shrink Br-{'X-j). 

In the BO context, under the assumption /(x) ^ 
QV{p{x), fc(x, x')), we choose xj) as the probabil¬ 
ity that X, any point in X that is a potential candidate 


to be a maximum, does not belong to Bj..{xj)\ 

= 1 - p(x e Br^{Xj)). (6) 

The following proposition (proof in Supp. Materi¬ 
als A) shows that this local penalizer can be computed 
in closed form. 

Proposition 1 Let /(x) he a QV with posterior mean 
/i„(x) and posterior variance cr^(x). The function 
ip{x; Xj) in Eq. (6) is a valid local penalizer of alx) at 
Xj such that: 

<P(x;Xj) = ier/c(-z) 

^ ^ “ ^11 “ ^ , 
for erfc the complementary error function, M = 
maxxGA' /(x) and L a valid Lipschitz constant. 

The functions ip(x]Xj) thus create exclusion zones 
whose size is governed by L. If p-ni^j) is close to M, 
then ip(x', Xj) will have a smaller and more localized ef¬ 
fect on a(x) (a smaller exclusion area). On the other 
hand, if /x„(xj) is far from M, ip(x;Xj) will produce a 
wider yet less intense correction on Q!(x). The value of 
L also affects the size of the effect of (p{x-, xj) on a(x), 
decreasing it as L increases. 

2.2 Selecting the parameters L and M 

The values of M and L are unknown in general. To 
approximate M, one can take M = max;^ ^„(x) or, to 
avoid solving this maximization problem, use the even 
rougher approximation M = maxi{yi}. 

Regarding the parameter L note that the definition 
of Lipschitz continuity in Eq. (3) does not uniquely 
































Algorithm 1 Batch Bayesian Optimization with Lo¬ 
cal Penalization (BBO-LP). 

Input: dataset Vx = {x^,7/^}”^;^, batch size nt, it¬ 
eration budget m, acquisition transformation g. 
for t = 1 to m do 
Fit a GP to Vt. 

Build the acquisition function a{'K,2t,o) using the 
current GP. 

L ^ uiaxx ||Aiv(x)||. 

for j = 1 to Tib do 

1. M-step: ^ argmax 2 ,g;t {dtj_i(x)}. 

2. P-step: Q;tj(x) ^ cttp W 11^=1 
end for 

Bt” ^ {xqi, ■ • ■ ,Xt,„J. 

yt,i, ■ ■ ■, yt,nb ^ Parallel evaluations of / at 
Pt+i ^ A U {{xtj,yt,j)}]Lx- 

end for 

Fit GP to Vn- 

Returns: xm = argmaxa;^;^ {/r(x)}. 


identify L. In the BO penalization context, small but 
feasible values of L are preferred, because they produce 
large exclusion zones and thus more efficient search. 
Given access to the true objective /, one can show that 
Lv = maxxeA” |jV/(x)|| is a valid Lipschitz constant 
(see Supp. Material C for further details). Note that 
Lv is the smallest value of L that satisfies the Lipschitz 
condition Eq. 3 in the limit xi —> X 2 in (3). 

We now construct an approximation for Lv. Assum¬ 
ing that / is a draw from a GP with a (at least) twice 
differentiable kernel k, the gradient of / at x* is dis¬ 
tributed as a multivariate Gaussian V/(x*)|X, y, x* ~ 
A/’(/xv(x*), Sy(x*)) with mean vector 

Aiv(x*) = aK„,*(x*)K„ V, 


and covariance matrix 

E^(x*) = - dK„,,(x*)K;'aK„,,(x*)^ 


for K„ = K„ -I- CT^I and where, for i, j = 1,..., d and 




dkxjx*) 
dxP) ’ 




d2A:(x*,x*) 

dx^^'fdxB) 


We choose 


Lgp-lca = max ||/rv(x*)|| 


and call this the Gaussian Process Lipschitz Constant 
Approximation criterion (GP-LCA). Note that this 
definition of Lgp-lca ignores the variance of the gra¬ 
dient, which could be used to identify candidate points 


to improve the approximation of Lv in a Bayesian op¬ 
timization fashion. The supplement contains further 
experiments supporting the quality of this approxima¬ 
tion. See Algorithm 1 for a description of all the steps 
described in this section. 


2.3 Heteroscedastic scenarios 

The use of an unique value of L assumes that the func¬ 
tion to optimize is Lipschitz homocedastic. Although 
this is a typical hypothesis for most BO methods, re¬ 
cent works have pointed out that some real problems 
do not satisfy this condition [Assael et ah, 2014]. It 
is not the goal of this work to analyze this case fur¬ 
ther but, interestingly, the method proposed here can 
be extended to non-Lipschitz cases by replacing L in 
the penalizers (/?(x;xj , L) by a local values of L. For 
instance, a possible approach would be to replace the 
local penalizers by (/?(x;Xj,Lj) where Lj = ||/iv(xj)||. 


2.4 Optimizing the penalized acquisition 
function 


The optimization of (4) can be performed most eas¬ 
ily by any gradient-based method in the log space be¬ 
cause there, the gradients have an additive form. More 
formally, when the transformation used to make the 
acquisition positive is the soft-plus function, g{z) = 
ln(l -|- e^), the gradient of logdt_fc(x;Li^o)) being 
ai_fc(x;L(_o) the penalized acquisition in Eq. (4), is: 

I ga(X;It.o) 

V\nat,k{x,It,o) = ga(x;it.o)) 1 + e“(x;it,o) ’ 

k-1 

Wa{x;Xtp) ■ 

V(/?(x;X(j), 


where Vq:(x;L(^o) is the (assumed known) gradient of 
the original acquisition function and V:^(x;x(j) are 
the gradients of the local penalizers 


Vv3(x;xtj) = 


2L 


^27rcr2(xj) 


■(xi -x) 


See Supp. Materials B for details. 


3 Experimental Section 

This section compares the performance of Algorithm 1 
with the state-of-the-art methods for batch BO. We la¬ 
bel the different methods by means of the batch design 
type followed by the acquisition used: Rand is used 
when the first element in the batch is collected max¬ 
imizing the acquisition and the remaining ones ran¬ 
domly, B and PE denote the exploratory approaches in 



d 

Ub 

El 

UCB 

Rand-EI 

Rand-UCB 

SM-UCB 

B-UCB 


5 



0.32±0.05 

0.31T0.05 

1.86±1.06 

0.56±0.03 

2 

10 

0.31±0.03 

0.32±0.06 

0.65±0.32 

0.79±0.42 

4.40±2.97 

0.59±0.00 


20 



0.67±0.31 

0.75±0.32 

- 

0.57±0.01 


5 



9.19±5.32 

10.59±5.04 

137.2±113.0 

6.01T0.00 

5 

10 

8.84±3.69 

11.89±9.44 

1.74±1.47 

2.20±1.85 

108.7±74.38 

3.77±0.00 


20 



2.18±2.30 

2.76±3.06 

- 

2.53±0.00 


5 



690.5T947.5 

1825±2149 

9e-b04±7e-b04 

2098±0.00 

10 

10 

559.1±1014 

1463±1803 

200.9±455.9 

1149±1830 

9e-b04±le-b05 

857.8±0.00 


20 



639.4±1204 

385.9±642.9 

- 

1656±0.00 


d 

Ub 

PE-UCB 

Pred-EI 

Pred-UCB 

qEI 

LP-EI 

LP-UCB 


5 

0.99±0.74 

0.41±0.15 

0.45±0.16 

1.53±0.86 

0.35±0.11 

0.31T0.06 

2 

10 

0.66±0.29 

1.16±0.70 

1.26±0.81 

3.82±2.09 

0.66±0.48 

0.69±0.51 


20 

0.75±0.44 

1.28±0.93 

1.34±0.77 

- 

0.50T0.21 

0.58±0.21 


5 

123.5±81.43 

10.43±4.88 

11.77±9.44 

15.70±8.90 

11.85±5.68 

10.85±8.08 

5 

10 

120.8±78.56 

9.58±7.85 

11.66±11.48 

17.69±9.04 

3.88±4.15 

1.88±2.46 


20 

98.60±82.60 

8.58±8.13 

10.86±10.89 

- 

6.53±4.12 

1.44±1.93 


5 

2e-b05±2e-b05 

793.0±1226 

1412±3032 

- 

1881±1176 

1194±1428 

10 

10 

6e-b04±8e-b04 

442.6±717.9 

1725±3205 

- 

1042±1562 

100.4±338.7 


20 

5e-b04±4e-b04 

1091±1724 

2231±3110 

- 

1249±1570 

20.75±50.12 


Table 1: Results for the gSobol function across different dimensions, batch sizes and methods. For each algorithm, 
the mean and standard deviation are shown. Best results among the batch methods are highlighted in bold. 
represents that the method could not complete the first iteration within the time budget. The value of / at the 
minimum is always zero. El and UCB represent the Expected improvement and the upper confidence bound 
acquisitions. Rand stands for the random batch collection. SM is the simulating and matching approach. Pred 
is the predictive approach and LP the local penalization method presented in this work. qEI is the multi-point 
expected improvement. 


[Schonlau et ah, 1998] and [Contal et ah, 2013], Pred is 
used in cases when the model is used to generate ‘fake’ 
batch observations as in [Azimi et ah, 2012], SM identi¬ 
fies the simulating and matching method [Azimi et ah, 
2010] and LP stands for our local penalization method. 
The multi-point expected improvement [Chevalier and 
Ginsbourger, 2013] is denoted by qEI. Two acquisi¬ 
tion functions are used: the expected improvement 
(El) defined as a£;/(x;I„) = -|-(/)(u)) (t„(x), 

where u = (Ain(x) - 2 /min)/o-n(x) and $(•), <('(•) are 
the standard Gaussian distribution and density func¬ 
tions respectively and 7/min is the best current location 
and the Upper Confidence Bound (UCB) defined as 
Q:(7 Cb(x; 2„) = /r„(x)-|-«:(j„(x), with «; > 0. The batch 
methods that can be used with an arbitrary acquisi¬ 
tion function are tested using both, with the exception 
of the SM whose implementation is only available with 
the UCB. When used in a sequential setting (for base¬ 
line reference) the El and UCB are referred by their 
acronyms. In total, we use 2 sequential and 10 batch 
methods. To run the B, PE, SM, methods, we use 
the available Matlab code.^ The implementation of 

^http://econtal.perso.math.cnrs.fr/software/. Note 
that an alternative implementation of the GP-B-UCB 


these methods optimize / by searching its optimum 
in a fine grid, which is an advantage computationally 
but a drawback in terms of precision. The qEI was 
taken from the R-package DiceOptim.^ Unless speci¬ 
fied otherwise, the default implemented settings of all 
the previous methods are used. 

We perform: (i) a simulation in which the performance 
of the algorithms is compared for a fixed time bud¬ 
get across different problem dimensions, batch sizes 
and acquisition functions and (ii) a comparison of the 
gained information per second rate in three objective 
functions with different evaluation costs. We always 
minimize the objective, minimizing —/ in examples in 
which the goal is to find maximum of /. In all the 
experiments the exponentiated quadratic (EQ) covari¬ 
ance A:(x,x') = 0exp(— 7 ||x —x'll^), 0,7 > 0 is used in 
the GP, whose parameters are optimized by maximiz¬ 
ing the marginal likelihood from the best of 10 random 
initializations. The results are taken over 20 replicates 
with different initial values. All the simulations were 

code is available at http://www.its.caltech.edu/ tade- 
saut/GPBUCBCode/ but we used the former one for con¬ 
sistency in the comparisons. 

^http://cran.r-project.org/web/packages/DiceOptim 



done on Amazon EC2 servers with Intel Xeon E5-2666 
processors and 2 virtual CPUs except the SVR tuning 
with 16 virtual CPUs. 

3.1 Comparisons in terms of the dimension, 
batch size and acquisition function 

We consider the gSobol function (see Supp. Materials 
D) to compare the above mentioned methods across 
dimensions d = 2,5 ,10 and batch sizes, rib = 5,10, 20. 
Por methods using the UCB, k was fixed to 2, which 
allows us to compare the different batch designs using 
the same acquisition function. For dimension 2, 5 and 
10, we use a time budget of 1, 5 and 10 mins, respec¬ 
tively. Table 3 shows the averaged best value found by 
each algorithm for all the iterations completed within 
the time limit. In general, the batch methods using the 
UCB show a better performance that methods using 
the El, especially in dimensions 5 and 10. The overall 
best technique is the LP-UCB, that achieves the best 
results in 5 of the 9 cases. It is also notable that it ex¬ 
hibits fairly small standard deviations compared with 
the rest of the methods and it is coherent accumulating 
information about the optimum of / in terms of the 
batch size: as rib increases the results are consistently 
better. In dimension 2 and batch size 5, the LP-EI 
is the best method. There are three cases in which 
the LP batch designs are not the most competitive 
(although still providing good results). Exploratory 
approaches works well in low dimensional cases, being 
the B-UCB the best method in two scenarios. 

3.2 Comparisons in terms of the cost to 
evaluate the objective 

We choose three scenarios to compare the algorithms 
in terms of the running time. The examples corre¬ 
spond to three functions that are cheap, moderate and 
expensive to evaluate. More specifically, the first ex¬ 
periment uses a function (Cosines) that is inexpen¬ 
sive to evaluate but quite multi-modal. The second 
experiment is motivated by a wet-lab experimental 
design. We work with a surface that emulates the 
performance of mammalian cells in protein produc¬ 
tion given different gene designs. The function has 
dimension 71 and is is moderately expensive to eval¬ 
uate since it corresponds to the predictive mean of a 
GP trained over 1,500 data instances. The qEI was 
not used in this experiment due to the huge computa¬ 
tional effort required to jointly optimize the batches in 
dimension 71. The third experiment involves the tun¬ 
ing of the three parameters of a support vector regres¬ 
sion (SVR) [Drucker et ah, 1997] in a example with 
45730 instances and 9 continuous attributes [Bache 
and Lichman, 2013]. The objective function is the 
cross-validation error of the model, which is expen¬ 


sive to evaluate due to the amount of data used. See 
Supp. Materials D for further details. We take a batch 
size of rih = 5 for the Cosines function and rib = 10 
for the wet-lab and SVR experiments. We compare 
the averaged best found results in terms of the num¬ 
ber of collected batches and the wall-clock time. In 
the last experiment we use the SVR implementation 
available in scikit-learn"^ and only the methods imple¬ 
mented in python are used (El, UCB, Rand-EI, Rand- 
UCB, Pred-EI, Pred-UCB, LP-EI and LP-UCB). 

In the Cosines experiment both the sequential El and 
UCB policies achieve the best results during the first 
10 iterations of the algorithms (2 full batches). As the 
algorithms progress, however, a significant improve¬ 
ment is observed by the LP-EI and LP-UCB methods 
in terms of the number iterations and in terms of the 
wall-clock-time. When many points are collected, the 
update of the GP is more expensive and a good batch 
design is able to explore regions that the sequential 
method cannot. The rest of the batch methods, how¬ 
ever, are not able to do this exploration efficiently, 
which leads to poorer results. Similar results are ob¬ 
tained for the wet-lab experiment. The LP-EI and 
LP-UCB are again the most competitive techniques 
improving the rest of the batch methods and the se¬ 
quential policies. The differences are even more sig¬ 
nificant in this scenario. Since / is now more expen¬ 
sive to evaluate, the parallelization of the evaluations 
makes the search much more efficient, specially for the 
LP-UCB method. Regarding the last experiment, the 
cost of evaluating the function dominates the cost of 
designing the batch. In this case the performance of 
the different batch methods is comparable but signif¬ 
icantly better than the sequential policies due to the 
parallel evaluations of /. The results for the three 
functions are coherent with those observed in Section 
3.1 showing that the BBP-LP methods is overall the 
most efficient method for batches collection in BO. 

4 Discussion 

We have investigated a new heuristic for batch BO, 
BBO-LP, that significantly reduces the computational 
burden of non-parallelizable tasks. The resulting 
method can be used with any acquisition function and 
it is able to make fast and appropriate decisions about 
the locations where / should be evaluated. When the 
batch evaluations of / are parallelizable this is an im¬ 
portant advantage, meaning that they don’t lead to 
considerable additional computational overhead. We 
have found other interesting results. In simple scenar¬ 
ios, batch policies based on random exploration work 
reasonably well in terms of the information gained 

■^http://scikit-learn.org/stable/index.html. 




(a) Results for the Cosines function - batch iterations. 



No. collected batches 


(c) Results for the wet-lab function - batch iterations. 



(e) Results for the SVR function - batch iterations. 



Tlme(seconds) 

(b) Results for the Cosines function - running time. 



(d) Results for the wet-lab function - running time. 



(f) Results for the SVR function - running time. 


Figure 3: Results for the three test functions described in Section 3.2. Geometric figures on top of the lines 
represent the moments in which the batches are evaluated. See caption of Table 3 for details on the acronyms. 


per second. When the complexity of the problem in¬ 
creases, however, methods that make use of some in¬ 
formation about / improve the random policy. In par¬ 
ticular, the approach here proposed makes use of the 
Lipschitz continuity of / to model the interaction be¬ 
tween the elements in the batch. In spirit, this is sim¬ 
ilar to use the GP to predict the evaluations of / but, 
in practice, is much more efficient because it avoids the 
re-computation of the GP after every point is selected. 
The limitations of this approach are, however, deter¬ 
mined by the ability to learn correctly a small enough, 
and valid, Lipschitz constant for /. 

One could also wonder whether it is necessary to re¬ 
quire that sample paths from the GP measure on 
/ should be Lipschitz-continuous themselves. This 
would severely restrict the applicability of this notion, 
because the relationship between regularity of the ker¬ 
nel and the sample paths is complicated. Even if the 
kernel is Lipschitz-continuous, sample paths may not 


be Lipschitz [Adler, 1981]. However, our approach 
only tries to model the effect of evaluations on the 
BO objective, not the GP probability measure itself. 
Many BO objectives, in particular the El and UCB, 
are smooth functions of only the sufficient statistics 
(mean and covariance function) of the GP posterior. 
Both the posterior mean and covariance function are 
members of the Reproducing kernel Hilbert space in¬ 
duced by the kernel (i.e. they are weighted sums of 
kernel functions). Thus, if the kernel is Lipschitz, so 
is the acquisition function, even if the GP measure it¬ 
self has non-Lipschitz sample paths. Finally, note that 
our local repulsion criterion naturally suggests a Latin 
square design for the case when no functional values 
have been been acquired. The latin square design is 
widely suggested for this domain [Jones et ah, 1998]. 
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A Proof of Proposition 1 


Then, it holds that 


We compute the explicit form of the penalization func¬ 
tions iy9(x; Xj). The distribution of rj is Gaussian with 
mean [M — /L and variance a\{xj)lL^ by the 

properties of f{x.j). Then we obtain that 


(p{x; Xj) = 1 - p(x e (xj)) 

= l-p(rj > ||xj-x||p) 

= Pirj < \\xj - x||p) 


for 


= p A/'(0,1)< 


L\\xj - x\\p - M + p„(xj) 


= $ 


CT„(Xj) 

L\\xj - x||p - M + p„(xj) 


CT„(Xj) 


= -erfc i-z) 


Z = 


\/ 2 ct 2 (xj-) 


(L||Xj -x|| - M + ■ 


B Optimization of the penalized 
acquisition function 

Under the proposed local penalization method, to se¬ 
lect the fc-th element of the t-th batch requires the 
optimization of the function 


k-l 

dt.fe(x;It,o) = 5(a(x;2't,o)) n‘P(x;xtj), 

j=i 

which can be done by any gradient descend method as 
follows. We fist map the problem into the natural log 
space by observing that 

argmax{Q;t_fe(x,Zt_o)} = argmax{lndt,fc(x,Zt,o)} ■ 


Vln5t,fc(x,2't,o) = [g{a{x]Itfi)) ^ 

, ^ ff(Q;(x;Zf,o))]VQ;(x;Zt,o) 

ctct^x, ) 

k-l 

3 = 1 

where Va(x;I( g) is the (assumed known) gradient of 
the original acquisition function. In cases in which the 
acquisition is already positive it is natural to choose 
g{z) = z and the gradient of lndt_fc(x,Xt o) reduces to 

Vlndt_fc(x,Zt,o) = a(x;Z(_o)“^VQ;(x;It,o)-I- 

fc-i 

i=i 

When a(x;Zi^o) is not necessarily positive one can take 
g{z) = exp( 2 ) and the gradient simplifies to 


k-l 

Wlnat^ki^Xtg) = 'Va{x;Xtfi)+'^ (p{x;xtj)~^'Vip{x;xtj). 

i=i 


When g{z) = ln(l -|- e^) the gradient is 


Vln at,k{x,It,o) 


2 ga(X;It.o) 

ln(l + 1 + 

k-l 

Va{x]Itfi) + '^ip{x;xtj)~'^ ■ 
3 = 1 

S/(p{x;xtj). 


C Lipschitz constant approximation 

In this section we elaborate in the approximation of 
the Lipschitz constant. First, we include the follow¬ 
ing proposition that allows to uniquely identify a valid 
value of L. 


Applying the properties of the logarithms we trans¬ 
form the problem into the maximization of 

k-l 

\nat,ki^,It,o) = ln[p(a(x;Z(_o))] -b ^ In [v3(x; xtj)]. 

3 = 1 

The gradient with respect to x is now easy to calculate 
since the problem is in additive form. First, note that 
the gradients of the local penalizers W(p{x;xtj) are 


Proposition 2 Let / : A — )■ IR 6e a L-Lipschitz con¬ 
tinuous function defined on a compact subset A C IR"^. 
Take 

Lp = max||V/(a;)||p, 

where Vf{x) = • • •, ■ Then, Lp is a valid 

Lipschitz constant such that the Lipschitz condition 

\f{xi) - fix2)\ < Lp\\xi - X2\\g, 


Vipix;xt,j) = 


2L 


y/27Tal{xj) llxj - : 


-(xj-x), (7) 


with 


\/ 2 ct^(xj) 


{L\\xj -x|| -M-bAin(Xj)). 


where ^ -b j = 1, holds. 

Proof 1 Using the mean value theorem for every 
Xi,X2 & X there exist a w = Xij 3 x 2 , with ft e (0,1) 
such that, 


z = 


\f{xi) - f{x2)\ = \\/f{w){xi - X2)\. 



By the Holder’s inequality we have that 

\f{xi) - f{x2)\ < ||V/(w;)||p||a;i - X2\\q. 

Since w G X by definition, we have that 

\f{xi) - f{x2)\ < Lp\\xi - X2\\q, 
for Lp = maxxfzx ||V/(a;)||p. 

In order to test the empirical approximation of the 
Lipschitz constant detailed in Section 2.2 we use the 
Cosines function described in the experimental section 
of this work. The true Lv for this function is 8.808636, 
that was calculated by maximizing the norm of gradi¬ 
ent of / in a very fine grid. We check the quality of 
our approximation to Lv for increasing sample size up 
to 50 observations, where the locations of the points 
are randomly selected along the domain of / using a 
bivariate uniform distribution. The evaluations of / 
at the selected locations were perturbed with Gaus¬ 
sian noise with standard deviations a = 0,0.1,0.25. 
In Figure 4 we show the results for 30 replicates of 
the experiment. The average approximation of L con¬ 
verges to the true Lv, being this convergence slower 
when the evaluation errors increase. 



Figure 4: Approximation of the Lipschitz constant in 
the cosines function using the GP-LCA method. We 
compare the convergence in the approximation for dif¬ 
ferent noise levels and increasing sample size. For each 
sample size show we show the average of 30 replica¬ 
tions. Vertical bars represent the 95% confidence in¬ 
terval for the average estimate. 


D Detailed description of the 
experiments 

D.l Synthetic functions 

Table 2 contains the the details of the functions used 
in the experiments of this work. 


Name 

Function 

A 

gSobol 

Cosines 

/(x) = nti 

/(x) = 1 - Ei=i(5(2;z) - r{xi)) 
with g{xi) = (1.6a:i — 0.5)^, 
r{xi) = 0.3cos(37r(1.6a:i — 0.5)). 

[-5,5]^ 

[0,5]2 

Table 2: 

Functions used in the experimental section. 


All the parameters ai of the gSobol function are set to 
Oi = 1 in the experiments. 

D.2 Gene design experiment 

There is an increasing interest in the pharmacological 
industry in the the design of synthetic genes capable 
of transforming cells into ‘factories’ able to produce 
drugs of interest. In this experiment we emulate a 
gene design process. 

The function to maximize is the production of cell pro¬ 
teins, that it is known depends on certain features 
of the gene sequences. We built a GP to link gene 
features and protein production efficiency based the 
model described in Gonzalez et al. [2014]. A total of 
71 gene features are considered, which correspond to 
the dimension of the final design space. We validated 
the model with the remaining 2908 genes of the dataset 
and we used its posterior mean as the function to op¬ 
timize. We can understand this model as a mathemat¬ 
ical surrogate of the cell behavior in which the mean 
evaluations play the role of physical wet-lab gene de¬ 
sign tests, many of which can be run in parallel for the 
same price of one. 

D.3 SVR parameter tuning experiment 

Support Vector Machines (SVR) for regression 
Drucker et al. [1997] with an EQ kernel, depend on 
three parameters: the kernel lengthscale ( 7 ), the soft 
margin parameter {C) and the band size (e). A proper 
choice of the parameters is crucial to guarantee a good 
performance of the SVR, which is typically done by 
minimizing the mean square error (RMSE) in a test 
dataset. This task can be expensive, specially for large 
datasets. We use BO to optimize the parameters of the 
SVR using the ‘Physiochemical’ properties of protein 
tertiary structure’ dataset available in the UGI Ma¬ 
chine Learning repository Bache and Lichman [2013]. 
This dataset has 45,730 instances and 9 continuous at¬ 
tributes that are used to predict the coordinate root 
mean square distance (RMSD), a measure that de¬ 
scribes the distance per residue between to optimally 
aligned protein sequences. We trained the SVR using 
a randomly selected subset of 22,000 proteins and we 
tested the results of using the rest. Every iteration 
takes around 300 seconds. 


































